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ABSTRACT 

In this paper we introduce a generic model for multiplicative 
algorithms which is suitable for the MapReduce parallel pro- 
gramming paradigm. We implement three typical machine 
learning algorithms to demonstrate how similarity compari- 
son, gradient descent, power method and other classic learn- 
ing techniques fit this model well. Two versions of large- 
scale matrix multiplication are discussed in this paper, and 
different methods are developed for both cases with regard 
to their unique computational characteristics and problem 
settings. In contrast to earlier research, we focus on fun- 
damental linear algebra techniques that establish a generic 
approach for a range of algorithms, rather than specific ways 
of scaling up algorithms one at a time. Experiments show 
promising results when evaluated on both speedup and accu- 
racy. Compared with a standard implementation with com- 
putational complexity 0{m'^) in the worst case, the large- 
scale matrix multiplication experiments prove our design is 
considerably more efficient and maintains a good speedup 
as the number of cores increases. Algorithm-specific exper- 
iments also produce encouraging results on runtime perfor- 



Categories and Subject Descriptors 

D.i [Programming Technique]: General; F.l [Analysis 
of Algorithms and Problem Complexity]: General 

General Terms 

Algorithms 

Keywords: Multiplicative Model, Machine Learning, MapRe- 
duce 

1. INTRODUCTION 

In order to deal with increasing dataset sizes. Machine Learn- 
ing algorithms are required to be implemented in very large 
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scale. However, upscaling learning algorithms is not always 
straightforward. In recent years, the MapReduce paradigirQ 
[sj and its open-source implementation Hadoop [T^ has drawn 
increasing attention from industry for its remarkable ca- 
pability of processing large-scale data and straightforward 
functional programming representation. Questions have been 
raised that whether MapReduce paradigm can scale up learn- 
ing algorithms in a succinct fashion that a wide range of 
learning algorithms can benefit from. 

Among the major algorithms, similarity-based classification 
approaches occupy a dominant position in the Machine Learn- 
ing research field. Earlier attempts have made a variety of al- 
gorithms in this kind available for the MapReduce paradigm. 
Solutions to several key models including fc-Means and K- 
Nearest Neighbor (KNN) have been proposed at an early 



stage and good accelerating performances were reported 17 
|15| . Individual implementations are also documented for 
Support Vector Machines on Graphics Processing Units (GPUs) 
[Sj and Locality sensitive hashing (LSH) for Google News 
Personalization [t]. Because of this broad interest, two sim- 
ilarity and distance-based learning algorithms are selected 
as our upscaling target. 

Efforts have also been made for establishing a generic model 
that solves several algorithms at the same time. Gillick et 
al. grouped several machine learning algorithms into three 
categories: Single Pass, Iterative, and Query-based [9]. Re- 
cent work has been done by Chu et al. [s] where 10 different 
learning algorithms have been introduced. Inspired by their 
study, it is realized that a generic model of upscaling several 
Machine Learning algorithms may be found, and the imple- 
mentation work can be greatly reduced, thus, discovering 
a generic model of several learning algorithms is our main 
interests in this paper. 

Liu et al. proposed a methodology for web-scale Non-Negative 
Matrix Factorization in [12] using a multiplicative approach 
as described by Lee and Seung [ll]. An interest has been 
brought that two multiplication models can be generalized 
from their implementation for scaling up other learning al- 
gorithms, and it is also an inspiration for us to adopt mul- 
tiplicative methods on our targeting learning problems. 



The primary goal, therefore, of this research is to upscale a 
range of machine learning algorithms including Non-Negative 
Matrix Factorization (NMF), Support Vector Machines (SVM) 
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and PageRank by utilizing a generic multiplicative model on 
Map Reduce paradigm. 

The paper is organized as follows. Related works are dis- 
cussed in the next section, and we then introduce the prob- 
lem settings and solutions of these three algorithms, after 
which a theoretical study will be adopted for extracting mul- 
tiplicative models from three algorithms which defines our 
core problem in the next section. The methodologies em- 
ployed for extending these algorithms in large scale will be 
illustrated in two parts: a) common computation compo- 
nents parallelization and h) algorithm-specific settings opti- 
mization. Results from a wide range of experiments follow, 
and a brief conclusion is summarized in the final section. 

2. RELATED WORK 

In recent years, implementing Machine Learning algorithms 
on MapReduce Paradigm have been widely discussed in lit- 
eratures. Having been proposed in 2004, MapReduce is be- 
lieved to be the large-scale parallel data processing engine in 
Google for a wide range of services (e.g. webpage indexing 
and page repository hosting) [s]. [t] reported that the News 
Service in Google also takes suggestions from learning algo- 
rithms running on MapReduce. However, there is no direct 
evidence published by Google showing how PageRank [2] is 
operating with MapReduce. 

Efforts by individuals or groups other than Google have also 
contributed a number of ideas for running Machine Learn- 
ing algorithms on MapReduce. One of the most recent ef- 
forts by Liu et al. indicates a novel way of upscaling Non- 
Negative Matrix Factorization on MapReduce by using the 
multiplicative method where the iterative update approach 
described in [Tl] is adopted as a series of matrix multipli- 
cation. Several multiplication strategies are developed at 
different stages. In order to balance the load of servers and 
maximize the parallelization, a partitioning strategy is per- 
formed for large matrices. However, partitioning a huge ma- 
trix into single rows/columns and combining into multiplica- 
tive permutations may consume considerible computational 
resources. But considering the problem setting they need to 
handle (extremely sparse matrices), it is acceptable in most 
cases. In contrast, a more generalized plan is illustrated in 
section [5] 

Rather than upgrade one single individual algorithm at one 
time, generic frameworks for Machine Learning algorithms 
are also discussed in [9j |5] . Gillick et al. investigated a tax- 
onomy of standard machine learning algorithms, and data 
processing patterns were taken into primary consideration 
which leads to three major groups of algorithms: Single 
Pass, Iterative Learning and Query-based Learning, reveal- 
ing both advantages and limitations of the three learning 
paradigms. It is realised immediately after this research 
that a large set of algorithms can be phrased as MapRe- 
duce fashion by following the same pattern. Their work also 
illustrates the benefit that simplified MapReduce program 
representations offer to the Machine Learning community. 

The same year, a larger collection of algorithms have been 
implemented on MapReduce by means of the Statistical Query 
Statistical query learning uses statistical 



perform noise-tolerant learning. Given such theory and an 
objective function, learning algorithms are typically opti- 
mization algorithms that can be written in a summation 
form which naturally fits the MapReduce paradigm. Al- 
gorithms first calculate the sufficient statistics and gradient 
from a statistical query oracle and then aggregate them over 
all data points, thus datasets can be distributed among cores 
and the Map function is responsible for examining partial 
gradients while the Reduce stage checks through all Map- 
generated data for aggregation. While this method forms 
the foundation of the iterative update implementation in 
our study, it does not use statistical queries but generally 
borrows the idea of "Summation Form" to calculate the gra- 
dients directly from individual examples. 

SVM implementation, as a notable exception, has been il- 
lustrated and adapted efficiently on MapReduce for Graph- 
ics Processors, but failed to migrate to PC clusters since 
generally the traditional Sequential Minimal Optimization 
(SMO) [l3| method may require more than ten thousand 
iterations on a medium sized dataset [s] . This iterative pro- 
cess causes significant start-up overhead for general Hadoop 
PC clusters. Chu et al. also proposed their own SVM im- 
plementation under summation form, however, they fail to 
explain how to handle a gigantic kernel matrix for large-scale 
dataset. In this paper, we believe. Quadratic Programming, 
the naive form of SVM can be constructed by using two 
models extracted from the previous research reported in [5] 
and [12]. 

3. ADAPTED ALGORITHMS AND SOLU- 
TIONS 

In this section, definitions of three learning problems with 
their typical solutions are given. 

3.1 Non- Negative Matrix Factorization 

The definition of NMF is as follows: 

Definition 1. Given A € M.+"^^" and a positive integer 
k < min{m,n) , find a factorization of A into W G K-l-™'^* 
and H £ R-l-''^", such that divergence function I?(A||A) is 
minimized, where A = WH is the reconstructed matrix from 
the factorization, and the divergence function is defined as 

D(A||A) = ^(A,,, - A,,,)' = ||A- WH||" 



From the probabilistic view, NMF methods can be divided 
into different types in which each element Aij is an obser- 



vation from the distribution whose mean is At 



In this 



paper, we only consider one popular NMF, Gaussian NMF 
(GNMF) by taking 

Aij ~ Gaussian(Aij^,a^) 



GNMF is solved by Lee and Seung [TT], using a multiplica- 
tive approach: 
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properties of the data rather than individual examples to 



under which the divergence ||A — WH||^ is non-increasing 
after each update. 



3.2 Support Vector Machines 

The one-norm soft-margin SVM with fixed bias can be de- 
fined as: 

Definition 2. 

maximize W{a) = — ^ 5^ 2/jyjQ!iQj A'(xi, xj) 

subject to < fti < C and i = 1,2, . . . ,1 

This definition uses the fixed bias so that the constraint 
Yl^^i OiiVi = does not need to be explicitly included (for a 
formal definition please refer to [14| ). 

According to the definition |2] it is equivalent to minimize: 
I I 
-2^0!,; -I- ^ 2/,yjQ,aj/4:(xi,Xj) 

we set the partial derivatives wrt. a to 0, so that 

I 

-2j/,-H2^a,K(xi,Xj) = 

or 

Gq = y 

where Gij = K(xi,Xj). The optimal solution of parameter 
a* is then given by 

a = G-'y (2) 

and can be obtained by iterative gradient descent approach 
which will be introduced in section f42l 

3.3 PageRank 

PageRank |2: follows a recursive definition as follows: 
Definition 3. 

PjeA/{p,) 

where pi,p2,...,Pn are PageRanks for webpages, d is a 
damping factor between and 1 which simulates how quickly 
a "Random Surfer" is getting tired during surfing, A'^ is the 
total number of webpages, while L{pj) is the total number 
of outlinks for a single webpage. 

PageRank de facto represents the eigenvector for a stochas- 
tic matrix in a Markov chain with its maximal eigenvalue, 
1. This problem can be solved by a very effective approach 
called "Power Method". For a transition probability matrix 
P of a directed graph G, and vr donates the stationary prob- 
abilities of Markov Chain, so vr satisfies: 

TV = PtT 

Moreover, the tt is the principal eigenvector of matrix P, 
with its maximal eigenvalue 1. The stationary probabilities 
TV can be obtained by power method [T] which employs the 
iterative multiplication as follows: 

7r''+' = Ptt'' (3) 



4. MULTIPLICATIVE MODELS 

In order to parallelize these three algorithms by means of 
a generic approach, two types of common "multiplicative 
components" are extracted from given solutions. 

4. 1 Similarity Comparison and Distance Com- 
putations 

Similarity comparisons (e.g. dot product and distance com- 
putations) is a general calculation involved in a variety of 
learning algorithms. Dot product for two matrices with rows 
in the form of vectors, can be proceeded using the following 
matrix multiplication and transposition: 

A B = AB^ 

According to ([l]), each NMF update requires the multiplica- 
tion of two large matrices for similarity comparison. Both 
AH'^ and HH''^ calculate the inner products of n-dimension 
vectors on "column features" of matrix A. Similarly, W^A 
and W'^W gives the similarity measure of m-dimension vec- 
tors on the "row features" of matrix A. Similar story can 
also be found in ([2| where kernel matrix can also be formed 
by multiplying two matrices with training vectors and its 
transposition. 

Euclidean distances can be handled in the same way as both 
Euclidean distance and dot product can be written in "Sum- 
mation Form". 

This type of matrix multiplication is characterized by its 
large high-dimension of input matrices and high-density of 
output matrix which cause severely storage problems. In 
the next section, we will demonstrate how these problems 
can be avoided in our implementation by using high storage 
capacity and data locality feature of MapReduce. 

4.2 Gradient Descent and Power Iterative Method 

For many optimization problems, the aim is to learn a pa- 
rameter vector 6 from a linear system generalized A and a 
sequence of observations y. In general we have y — 6'^A 
and thus 0* = A~''"y. However, matrix inversion is com- 
putationally costly, and instead methods such as gradient 
descent are used. Addressed in [s], these algorithms can be 
adapted to "Summation Form" as well. 

As a typical Quadratic Programming (QP) problem, SVM 
can be represented in this form, and solved by a gradient 
descent method, in which each single parameter at can be 
updated by an increment: 

. t 9W(a') , T^/ N ,N 

' = '^^dci^ — = '7(-yi2^yjajK(xi,Xj) -f- 1) 

Particularly, the gradient G = AW(q) of W with respect 
to a can be expressed by linear algebra: 

G = 7). * (-y. * DK 1) (4) 

where D is a vector D — {Di, D2, ■ ■ ■ , Di}, and Di = yiOi. 
Obviously, for multiplying a large dense matrix K, the com- 
putational complexity of Q is dominated by the matrix 
multiplication DK, which suggests us this component can 
greatly benefit from parallelized implementation. 



Similarly, Power Method demonstrated in ([3| shows all the 
calculation in one iteration can be done by a simple mul- 
tiplication. Common component extracted from these two 
algorithms can help us to handle these specific types of com- 
putations on large data. 

In this category of matrix multiplication, two operands often 
varies in size. For instance, SVM gradient descent involves a 
large and dense kernel matrix multiplied with a much smaller 
column vector, so that the parallelism adopted in similarity 
comparison multiplication cannot be used in this case for 
efficiency consideration. 

5. METHODOLOGY 

In this section, we demonstrate how these two types of ma- 
trix multiplication can be adapted on MapReduce paradigm. 

5.1 General Matrix Multiplication 

It has been proved that Partitioning is an efficient solution 
to large-scale Matrix Multiplication on MapReduce 12 . We 
further generalise their approach by adopting the classical 
block matrix multiplication method. 

The typical method for distributed matrix multiplication is 
to use block matrix multiplication in which each operator 
matrix is partitioned across row or column, so that a large 
computation problem be divided and conquered. 



the first job is responsible for partitioning and grouping ma- 
trices, and the second concentrates on multiplying two sub- 
matrices and summing up the intermediate results generated 
from the earlier stage. Algorithm details are shown in Algo- 
rithms [ij [2] |3] and |^ 

Algorithm 1 Partition Mapper (for matrix A) 

Require: row ai G {ai, a2, . . . , ab} 
Require: partition schema {m, n, k} 

a i/m 

if a > m then 

end if 

step b/k 

for i = 1 to 6 step step do 
start i; end <— z + step 
7 i/step 
if 7 > n then 

gamma n 

end -f- h 
end if 

sub <~ subvector [start, end, ai) 
for P = 1 to k do 

emit{{a, jS, 7, i), sub) 

end for 

i i -I- 1 
end for 



Definition 4. For a matrix multiplication C = A x B, 
where A G K"'"' and B G K''^'=, and C G , a Partition 
Schem,^^m,n,k can he introduced so that A and B can be 
partitioned as 



Al,n' 



B 



Bi. 



B, 



Therefore, the result matrix C is also a block matrix with m 
row partitions and k column partitions, and each block Ca,i3 
can be obtained by 



7=1 



where a = 1 . . .m,j — 1 . . .n,/3 ■ 



.k 



5.1.1 Partition-Summation Process 
We summarize the algorithms described in definition [4] as 
a two-stage process illustrated in Figure [1] The first stage 
reads input A, B and outputs partitioned sub-matrices grouped 
by their "Partition Identifiers" (a, (3, 7, i) while the second 
stage performs the actual multiplication of partitioned ma- 
trices and sums them up into the final result. Partition 
Identifier a, /3 shows which part of the summation result be- 
longs to in result matrix while the 7 is used for identifying 
the sub-multiplication groups. In practice, this procedure 
is often split into two cascaded MapReduce jobs, in which 



^Hadoop also uses the word "Partition" to represent the idea 
of "Shard" we discuss further down. However, it has no 
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Algorithm 2 Partition Process (Reducer) 
Require: partition group identifier (a, /3, 7, i) 
Require: individual rows {Aj} G As„6, {Bfc} G 'Bsub 
emit{i, {Asub, Bsut}) 



5.1.2 Partition Schema 

Partition schema often has a very significant impact on per- 
formance, for example, increasing m and k duplicates the 
partitioned matrices for more sub-multiplication groups, while 
increase n may generate more intermediate results after par- 
tition multiplications are calculated. Generally, the compu- 
tational complexity of this approach is 0{m x n x k), how- 
ever, considering the sparsity of A and B, the computational 
complexity is often lower in practice. 

5.1.3 Sharding, Hashing and Computational Local- 
ity 

In MapReduce, communication cannot be made between 
nodes except the "Shuffiing" stage , at which step interme- 
diate results generated from Map Stage are transferred to 
the nodes referring to the Computational Locality (i.e. 
the place where their final computation will be made). Each 
piece of intermediate results grouped by computational lo- 
cality is called Shard in MapReduce. 

In this case, locality can be maximized when sub-matrices 
are multiplied where the summation operation will be pro- 
ceeded, so that all computation can be done without data 
transfer after partitioning. When each shard is emitted from 
its original Mapper, a function h determines where each 
piece of shard is going to be located. Using each Partition 



Partitioning 






a = l,p = 1 


a = l,p = 2 




a = m,p = k 


7=1 










7 = 2 


412XB21 






J^m2^B2i, 












y = n 











Summation 



re = 1 


H = 2 












ft n 



Figure 1: General Matrix Multiplication with Par- 
titioning and Summation process 



Identifier as its input, a naive form of h can be written as: 

h„aive{ce, P,'y) = a mod p 

where a = 0, ...,m — l,/3 = 0, — 1,7 = 0, ...,n— 1 

and p denotes the number of computing nodes. 

However, in some cases, where p 3> m, sharding w.r.t a may 
cause severe computational unsaturation. An improved form 
of hrand cau be introduced as: 

hrand{a,P,^) ^ hash{a,l3,^) mod p 

where the function hash calculates hash of all three param- 
eters, and guarantees uniformity for its result, such that 
shards and computation can be equally distributed to each 
machine. Unfortunately, using this function, data locality is 
violated, since this sharding policy depends on all its three 
parameters rather than a itself, thus, a secondary shuffling 
may be triggered before the summation stage. 



Algorithm 3 Summation Process (Mapper) 

Require: partition group identifier (a, /3, 7, i) 
Require: aggregated sub matrices {Aa^;,, B^ui,} 

for each row c G AsubBsub do 
emit{i, c) 

end for 



Algorithm 4 Summation Process (Reducer) 

Require: row index i, partial results C G {cn, Ci2, ■ ■ ■ ,Cin} 

Ci ^ 

for each c G C do 

Ci = Ci + c 

end for 

emit(i, Ci) 



5.2 Iterative Multiplicative Update 

Compared with the earlier approach, the implementation for 
this update is much simpler and straightforward. Since the 
matrix B is often small enough to fit into memory and is 
very convenient to be transmitted among machines. In this 
implementation, they are distributed to each machine before 
each iteration, so that during the computation process, no 
communication is required. Therefore, the matrix A can 
be split into random pieces as long as each entire row is 
preserved. For each row vi in segment of matrix A, row ci 
in final result can be obtained by 



= riB 



(5) 



The mechanism of "Distributed Cache" provided by Hadoop 
can be regarded as a possible approach for distributing small 
matrix B among computing nodes. This functionality allows 
servers only load matrix once when initiating the calculation, 
and no communication is required afterwards. 

5.3 Algorithm- Specific Settings 

Although two general models have been extracted from three 
algorithms, for each algorithm, specific settings still need to 
be discussed for performance optimization. 



5.1.4 Partitioning Strategy 

The sparsity maybe the first factor that should be taken into 
account since the sparsity of matrices output may largely 
affect the efficiency of the "Shuffle" stage where the interme- 
diate results are combined and aggregated. As noted before, 
for partition schema m, n, k partitioning duplicates original 
matrices in m x fc times, and the size of intermediate results 
before final summation is n times of final output so that for 
a large and dense matrix, the size of both partition groups 
and intermediate results may be too large to be transmitted 
via network. 

The second concern is the profile (e.g. width and height) 
of matrices being multiplied. Generally, the number of par- 
tition should be as small as possible so that not too many 
partition groups are generated and small amount of interme- 
diate results are emitted, however, the number of partition 
should be large enough so that computational tasks can be 
distributed equally, and parallel computing can be fully uti- 
lized. 



5. 3. 1 Non-Negative Matrix Factorization 
Shown by Definition [T] the essence of NMF is to decom- 
pose one large matrix as a multiplication of one "tall" ma- 
trix (m ^ k) and one "fat" matrix [k <^ n), according to 
([1]), solution of NMF involves both types of multiplicative 
models. It follows that AH'^, HH'^, W'^A and W'^W 
can be classified as "Similarity Measure Multiplication" as 
mentioned above and considering the size and sparseness 
of different operands matrices, optimization can be made 
when the unique profiles of these matrices are utilized. Since 
these four multiplications are symmetric, for simplicity, only 
W'^A and W'^W are discussed in this section. 

Typically, W'^A handles one large sparse matrix A G R™^" 
and a fat matrix W''^ G R*"^™, and generates a dense fat 
matrix where k <^ m,n. A sensible partition schema should 
avoid splitting across columns(the shorter edge) of W'^. In 
contrast partition on rows(the longer edge) is a good plan, 
since a entire row may be simply too large to be fit to mem- 
ory. Besides, splitting across columns may only have little 



effects on performance but largely increases the partitioning 
workload. 

Similarly, the partition schema of W'^ W can also be formed 
when considering factors listed above. Matrix W G u™x*: 
and its transposition which are "tall" and "fat" matrices re- 
spectively, multiply and generate a very small matrix C £ 
j^fexfc Pqj. reason stated above, splitting across rows of 
W''^ is preferred, while each column vector of W'^ or row 
vector of W should be remained as a unity. 

In order to compute Y — W'^^WH, our second multiplica- 
tive model can be adopted when the small matrix C has 
been obtained from W^'^W. In normal cases, it is feasible to 
distribute C among machines. Several minor changes have 
to be conducted before this calculation can be fit into |[5|. 
To compute each row Yi G Y, entire row Ci G C and col- 
umn Hj G H must be accessible for the machine where Yi 
is going to be calculated. However, traditional row-based 
file format only allows us to retrieve matrix by row. This 
problem can be solved by transposing matrix H, and each 
column Yj G Y can be obtained by computing Yj = HjC^, 
where Hj is read from transposed matrix H''^. In fact, since 
each single column Yj is sequentially stored in final output, 
we actually compute the Y'^ instead of Y. 

5.3.2 Support Vector Machines 

Kernel computation is a typical pairwise similarity compar- 
ison that stores results in the form of kernel matrix which is 
a large dense matrix grows quadratically with size of train- 
ing set. Training set T G R"'^'' is formed as a matrix where 
each training example T, is represented as a vector with b 
dimensions. Linear kernel K G ffi"'*° can be calculated as 

K = TT'^ 

where T is usually a sparse matrix. 

Since both a and b may be large in practical, partition 
schema may take both dimensions into account. The real 
challenge is the large data storage before the final result is 
summed. Because of the density of matrix K, the size of 
intermediate results may grow linearly when the number of 
partitions on b grows, however, although partition on a may 
also duplicate partitioned groups, matrix T is usually very 
sparse so that duplication may have less space requirement. 
In conclusion, a larger m, k is preferred to a larger n in this 
case. 

The SVM training process can be conducted by iterative 
multiplicative process (see Q). 

5.3.3 PageRank 

As mentioned above, the power method Q can be expressed 
very well using iterative multiplicative model, no specific en- 
gineering on our framework is needed. However, there may 
be some issues when the "Damping Factor" is introduced. 
Limited by space, no further related discussion will be made 
in this paper. 

6. EXPERIMENTS AND RESULTS 

Experiments in this research are conducted in two phases: 
a) experiments of similarity multiplication, b) experiments 
of parallel algorithms. 



In the first phase of our experiments, we focus on matrix 
multiplication in order to explore the potential optimal set- 
ting among a wide range of parameters and reveal the re- 
lationship between time performance and the size, sparsity, 
and partitioning strategy of input matrices. The final ob- 
jective is to form a practical guide for choosing optimal pa- 
rameters. 

The next stage of the experiment is designed for the adapted 
algorithms. These experiments are conducted for two pur- 
poses: a) verifying the correctness of our implementations, 
b) illustrating performance impact brought by paralleliza- 
tion. Evaluations and analyses will be given after each group 
of experiments respectively. 

6.1 Experimental Environment 

As a mature MapReduce implementation, HadoofJ^ is se- 
lected as our experimentation platform, and algorithms are 
implemented in Java under JDK 1.6 runtime. All experi- 
ments mentioned in this section are performed on the Blue- 
Crystal High Performance Computing Clusteij^ Having achieved 
a performance of 28.4 TFlop/s, BlueCrystal was placed 86th 
in the Top500 in June, 20080 

6.2 Matrix Multiplication Experiments 

6.2.1 Dataset Construction 

With the help of MapReduce, a large matrix A G R'"'<" 
can be generated in parallel using the matrix generator with 
the MapReduce algorithm given in Algorithm |5] and Algo- 
rithm |6] where 5 is the sparsity defined by the fraction of 
non-zero elements in all elements. By default, m — n = 
2^^(5 = 2'^ 



Algorithm 5 Matrix Generator (Mapper) 
Require: matrix height m 

for i 1 to m do 
emitii, {}) 

end for 



Algorithm 6 Matrix Generator (Reducer) 
Require: row index i, sparseness 5, matrix width n 
row {} 
for 1 to n do 

if randomi) < 5 then 

row row U {randoraQ} 
end if 
end for 
emitii, row) 



In this group of experiments, the height and width are equal, 
and both are in a factor of 2 in order to demonstrate the po- 
tential non-linear relationship between experiment variables. 
The square matrix is also straightforward in the representa- 
tion of space and time complexity. 

The major advantage of generating a matrix by using MapRe- 
duce is the output matrix from a generator will be well 

^http:/ /hadoop. apache. org/ 
*https://www. acre. bris.ac.uk/ 
^https://www. acre. bris.ac.uk/acrc/hpc.htm 




Figure 2: Elapsed time wrt. the number of 
rows/columns plotted in Log-Log space. In this pic- 
ture, baseline are dashed lines through the first blue 
points with slopes = 3. Partitioning time and sum- 
mation time are also drawn along with the total 
elapsed time. 



balanced among all the computing nodes, which means the 
work load for each Mapper will not be significantly distin- 
guished. Therefore, the data locality can be enabled at the 
beginning. 

This generator is also employed for other purposes to gen- 
erate particular matrix (e.g. random initial values), which 
may have different requirements compared with the experi- 
ments listed here. Details are not listed in this paper. 

6. 2. 2 Computational Complexity w.r.t m 

Theoretically, for multiplication of two square matrices A, B G 
jjmxm^ the computational complexity is 0{m?) that dom- 
inates among all matrix operators. Implemented from the 
naive algorithm, the complexity of our operator remains the 
same. However, in most cases (particularly in Text Min- 
ing applications), the matrices are often extremely sparse 
although they usually have very high dimensions. Consider- 
ing this fact, the actual computational complexity should be 
much reduced than the theoretical complexity when sparse 
matrix algorithm is used. 

In this experiment, the performance of a matrix multiplica- 
tion is investigated through 6 tests, each of which performs 
a multiplication between two randomly generated squared 
matrices A,B = R'"^™. The parameters remain at their 
default settings. 

Figure 2 generally generally confirms the assumptions we 
have made earlier. Three lines are plotted on the graph 
which show the increasing trend of running time, partition 
time and summation time when the size of matrix increases. 
Beyond m = 131, 072 we reach the storage limit of the HPC 
cluster. As can be expected from the O(m^) complexity 
of matrix multiplication, all lines grow non-linearly with 
increasing slopes. The "Baseline" is plotted as a theoret- 
ical benchmark which illustrates the theoretical prediction 



Figure 3: There is an almost linear correlation be- 
tween elapsed time and the number of non-zero cells 
in operand matrices. 

of elapsed times for the each experiment. 

From Figure [2] we can see that the gap between the theoret- 
ical prediction and actual results are large. For all sections 
of the curves, slopes are lower than predicted. Even at the 
final stage of the curves, the slope is approximately 2 com- 
pared with 3 of the prediction baseline. This gap can be 
explained by the sparsity of input matrices for which a large 
number of zero cells are not stored and actually computed. 

Figure[2]also illustrates that the total running time is equally 
shared between the partitioning and calculation stage. 

6. 2. 3 Performance w.rt 5 

In order to discover the correlation between the elapsed time 
and the number of non-zero cells in A, B, we plot the Tsiapse 
vs sparsity in the Figure [3] The whole experiment is per- 
formed on two randomly generated square matrices where 

From this picture it can be seen that the elapsed time grows 
linearly when the number of non-zero cells increases. 

The linear pattern can also be explained when looking back 
at Figure [2] Since the elapsed time and the number of non- 
zero cells are both quadratically correlated with m, a linear 
relationship can be expected between these two variables. 

6.2.4 Performance w.rt Partitioning Function and Schema 

As has been mentioned earlier, choosing different partition- 
ing strategies can have significant impact on performance. 
The naive partitioner is developed for maximizing the us- 
age of data locality. This experiment demonstrates how 
much benefit is obtained by the naive partitioning strat- 
egy. We use the same settings employed in section |6.2.2[ 
However, the experiments in this section are repeated three 
times with different partitioning functions: a) hrand with 
partition schema (m — 20, n — 6,k — 20), b) hrand with 
partition schema (m = 40, n = 6, A; = 40) , c) h„aive with 
partition schema (m = 20, n = 6,k = 20). 
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Figure 4: Space and time performance of three dif- 
ferent partition strategies used in matrix multipli- 
cation. Partitioning time for two different random 
partitioners also illustrates on the first graph. 

From the two charts in Figure [4] we can see clearly that 
compared with the other two partitioning strategies, hnaivc 
enjoyed less intermediate results and smaller Teiapse on both 
pictures, as we have predicted in section |5.1.4[ The hnaive 
has a great potential in reducing the size of intermediate 
results for the usage of data locality. It is also not surprising 
that hrand wlth larger (m, n, k) leads to more intermediate 
results for the reasons stated in section 15.1.21 

6.2.5 Speedup Rate 

Considering the capacity of a single machine and the soft- 
ware available, only the relative speedup is adopted for the 
test in this experiment. From the graph in Figure [5] a clear 
trend of speedup rate can be seen. The speedups with three 
different 5 are all lower than the linear speedup which upper- 
bounds all the practical speedup according to Amdahl's law. 
On the matrix with 5 — , the linear speedup is almost 7 
when 8 workers are enabled. The other curves with 5 = 2"^" 
and 5 = 2~^^ also have similar speedup rates when the num- 
ber of working machines is less than 8. 

The reasons for the gap between the practical speedup and 
the theoretical upper bound can be various. Although the 
code we wrote in MapReduce fashion can be all parallelized, 
several maintenance operations may be conducted during 
the experiment, such as check-pointing and auto-balancing. 
Interestingly, we can observe that speedup for sparser ma- 
trices are lower than those with high density. This fact sug- 
gests that for small and sparse matrices, a small number of 
clusters can do the best job, while for larger clusters, the 
whole system is not saturated and the rest of the computing 
resources (e.g. memory and idle CPUs) are wasted. 

6.3 Experiments of Parallel Algorithms 

In this section, the correctness and parallel performance of 
our algorithm implementations are tested against a range of 
datasets. 

6.3.1 Datasets 
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Figure 5: Speedup rates of three different 5 choices. 
The ideal speedup is illustrated by the diagonal line. 



In considering two different experiment objectives, two groups 
of problem sets are chosen for different purposes. For cor- 
rectness experiments, we test our implementation against a 
series of small problem sets which are fit for both our par- 
allel implementation and sequential algorithms. For paral- 
lel experiments, different test sets are selected for different 
problems settings of algorithms. During the SVM experi- 
ments, a typical text classification problem is tested by using 
the Reuters Corpu^with various number of training exam- 
ples ranging from 12k to 192k, while for NMF, a random 
generated matrix is used for factorization process, and fi- 
nally, Wikipedia 2008 corpuij^which contains 5,716,808 com- 
pressed wiki-pages from Wikipedia English Website are used 
for PageRank link analysis. 

6.3.2 Experiments on SVM 

We use LibSVM [2] for comparison because it is widely 
used and has been tested against many publicly available 
datasets. In this experiment, it is set to the default model 
without using any optimization techniques. Table [l] shows 
accuracy comparison of LibSVM and the Multiplicative Mod- 
els. Linearly separable datasets are marked by *. As can 
be seen from the table shown in[l] for two linear separable 
problem sets, our implementation offers very close accuracies 
compared with those achieved by LibSVM. The best perfor- 
mance is reached on the IRIS dataset with 98.7% compared 
with 97.33% from LibSVM. Besides, for non-linear separable 
problems (Reuters-21578 and ADULT), our implementation 
still achieves a promising accuracy (69.3% and 87.2% respec- 
tively), although LibSVM enjoys a much higher accuracy for 
its soft-margin classification implementation. 

For parallel experiments, our SVM trainer will be trained 
on a series of subsets of Reuters Corpus for a fixed iteration 
numbeij^with increasing number of instances. The experi- 
ments start with 12K training examples, and end up with 
192K training examples. The size of kernel matrix gener- 
ated and training time for one iteration (one multiplicative 
process) is reported in table [2] 

®http: / /about. reuters.com/researchandstandards/corpus/ 
'^http:/ /www-connex. Iip6.fr/ denoyer/wikipediaXML/ 
* Stopping criteria for SVM are beyond the scope of the cur- 
rent paper. 



Table 1: Accuracy comparison with sequential Lib- 
svm 



Dataset 


Libsvm 


Multiplicative Models 


IRIS* 


97.33% 


98.7% 


ADULT 


82.1% 


69.3% 


WINE* 


98.5% 


95.2% 


HEART (Binary) 


97.6% 


95.2% 


Reuters-21578 


92.2% 


87.2% 



Table 2: Parallel performance of SVM on Reuters 
Corpus 



NO. 


Kernel 


Kernel 


T' 


Accuracy 


Training 


sparsity 


Matrix 






Examples 










12K 


0.82 


1.47 


33s 


82.5% 


24K 


0.83 


5.70 


45s 


80.3% 


48K 


0.86 


22.3 


66s 


74.2% 


96K 


0.87 


74.0 


180s 


72.4% 


192K 


0.86 


230 


420s 


70.0% 



The first two columns listed in table [2] illustrate the major 
issue that SVM sufTers from. Dense and large kernel ma- 
trix makes the SVM less attractive for large-scale problem 
sets. All kernel matrices that listed above have a very high 
density level (above 0.8). Moreover, the size of the ker- 
nel matrix also grows with the number of training examples 
quadratically. For sequential computing and traditional gra- 
dient descent, these kernel matrices are not acceptable for 
in-memory access. 

The last two columns report the elapsed time for each iter- 
ation and accuracy. Generally, our algorithm can obtain an 
acceptable accuracy when adopted on large scale of dataset 
within a reasonable time. Interesting patterns can be found 
in the last column where the accuracy is dropping as the 
number of instances increase. We assume this may be caused 
by our stopping criterion that terminates our algorithm at 
an immature stage, however, we can not find a very strong 
proof for our guess, and we believe this remains a tasks for 
future study. 

6.3.3 Experiments on NMF 

The aim of this set of experiments is to show our imple- 
mentation has gained the full capability of solving the large- 
scale matrix-factorization problem on HPC cluster. Since 
the update of W and H are symmetric, in this section, only 
performance of updating H will be discussed. 

The performance is measured by three different computa- 
tional components in our algorithm. For each component, 
four conditions under two categories will be considered. Sim- 
ilar to the evaluation of matrix multiplication, the effect of 
matrix sparsity is of interests. Moreover, the parameter k 
will also be considered under each sparsity setting. 

The first interesting pattern that can be found among the 
data listed in the table [S] is the elapsed time of computing 
X = W'^A dominates both computational costs in terms of 
both sparsity settings. The reason for this is that the matrix 
A is commonly larger than both W and H, and the parti- 



Table 3: Parallel performance of NMF on Random 
Matrices 



Computational 
Components 


5 = 2'' 
fc = 8 A; = 32 

Telapse Telapse 


S = 2-i" 
fc = 8 fc = 32 

Telapse Telapse 


X = A 
Y = W^WH 
H = H. *X./\' 


46 129 
23 60 
12 14 


20 30 
13 16 
11 12 


10° 
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Figure 6: PageRank plot of HarvardSOO and 
Hollins.edu respectively in descending order 

tioning may have a higher cost compared with others. For 
this reason, we can expect a drop of elapsed time when A 
becomes sparser. This expectation has been verified by the 
other half of our table, which shows, Teiapse reduces dramat- 
ically when the sparsity decreases to 2"^" . The performance 
of other components drops accordingly. The general trend 
in this table fits well with the results reported in [12| . 

However, it can also be observed that the computation of 
H = H. * X./Y is almost a constant, although the compu- 
tational complexity of this component should be 0{n). It 
suggests that the capacity of the current cluster is not satu- 
rated. In this case, it may be more eflicient to run on local 
machines than in a distributed environment. 

6.3.4 Experiments on PageRank 

We first test our implementation on two small datasets: a) 
Harvard500, which is a directed graph adjacent matrix gen- 
erated from 500 webpages crawled from Harvard University 
website, h) Hollions.edu, which is also a web matrix organ- 
ised from crawled web-content on Hollins.edu. As mentioned 
before, we use the Wikepedia 2008 Corpus as our parallel 
testing dataset. 

After adopting our PageRank calculation on both datasets, 
the obtained PageRank array is sorted in descent order, and 
plotted into Log-Log space in Figure [6] 

It can be noticed that the PageRank of Hollins.edu is smaller 
than Harvard500, since the number of webpages (i.e. row/column 
in web-matrix) in Harvard500 is much smaller than that in 
Hollins.edu. (According to the algorithm, the sum of all 
PageRank should be exactly 1. ) 

The PageRank for both datasets reduces linearly in LogLog 
space. It suggests that the PageRank of these two datasets 
conforms to certain logarithm distribution, and this assump- 
tion can be verified by two probability plots shown in Fig- 




Figure 7: Probability plot of PageRank for lognormal distribution 



ure[71 

From the two charts in Figure [7] it can be seen that Page- 
Rank on both datasets fits the lognormal distribution well 
except for samples on the left hand side, where the Page- 
Rank is lower than a certain value and remains equal to 
each other. This sign may suggest that the PageRank on 
both datasets has not fully converged for small values. 

In order to evaluate performance on real-world dataset, our 
algorithm is executed on the Wikipedia dataset in parallel. 
By using 30 nodes, our implementation finishes 100 iter- 
ations within 1 hour, and provides us with the following 
results shown in Figure [S] 

The two graphs plotted in Figure [S] show different patterns 
compared with those generated on small datasets. In Log- 
Log space, the PageRank falls suddenly after the 10®*'', 
which implies that there is a small collection (approx. 2 ~ 
3 X 10® ) of Wikipedia entries not connected with the most 
regular entries below a certain level of PageRank (approx. 
10~^). The probability distribution shows that top ranks 
also conform to Lognormal Distribution, however, it can be 
noticed that there is a large PageRank shift for the small- 
est ranked group of entries. It can be regarded as another 
interpretation of the pattern on the left chart in Log-Log 
space. 

Another interesting result is reported in the table |4] which 
lists the top 10 results sorted by PageRank. 

Although Wikipedia does not release its official popularity 
ranking, measured by our daily life experience, a reasonable 
popularity list seems to be produced by our PageRank im- 
plementation. However, it also gives an arguable rank of 
entry Geographic Coordinate System. 

Generally, as the PageRank of webpage decreases, its num- 
ber of inlinks also drops. The only exception is the entry 
Wiktionary, which is ranked 5th, while its inlinks number 
fails to list in the Top 100 entries sorted by No. inlinks, 
thus it is marked by "< 29076" (where 29076 is the smallest 
inlink number among Top 100). 



Table 4: Top-10 ranked Wikipedia Entries with their 
PageRank and NO. inlinks 



Wikipedia Entry 


PageRank 


NO. Inlinks 


United States 


0.0021 


374934 


2007 


0.0014 


266614 


2008 


0.0014 


286409 


United Kingdom 


0.0011 


139325 


Wiktionary 


0.0009 


< 29076 


Geographic coordinate system 


0.0009 


294604 


2006 


0.0009 


146336 


Wikimedia Commons 


0.0008 


70096 


English language 


0.0006 


69408 


Germany 


0.0006 


95366 



7. CONCLUSION AND FUTURE WORKS 

In this paper, three machine learning algorithms: Non-Negative 
Matrix Factorization, Support Vector Machines, and Page- 
Rank have been successfully adapted to the new parallel 
paradigm MapReduce for large-scale problems. 

To achieve this goal, two generic multiplicative components 
are extracted from the solutions to these problems. We fur- 
ther discussed the unique features of these models concern- 
ing different parallel settings. A general configuration of 
"Partition Schema" has been introduced to describe the par- 
tition strategies under different sparsity /density settings and 
various scales of clusters. 

Our main contribution in this study is to propose a generic 
model that is efficient for upscaling a set of learning algo- 
rithms, particularly those involving distances and similarity 
measures, and iterative multiplicative updates. Compared 
with upscaling different algorithms individually, we believe 
our approach provides one general solution to the large-scale 
learning problems. 

Analysis from our experiments shows that all our models 
are correctly adapted and successfully applied to three algo- 
rithms. For partitioned matrix multiplication, we observed 
a encouraging linear speedup and reduced computational 




Figure 8: PageRank plot and probability plot on Wikipedia Corpus 



time compared with theoretical prediction. An optimiza- 
tion brought by partitioning schema has also been discov- 
ered. Our experiments also show that the density of matrices 
being multiplied is a crucial factor for speedup rate. 

Three learning algorithms have also been tested for correct- 
ness and performance. Results show NMF also achieves a 
good speedup rate, especially for the portion of computa- 
tion related to similarity comparison multiplication. SVM 
achieves a comparable performance to the mainstream SVM 
toolkit LibSVM while also revealing good scalability on ex- 
tremely large kernel matrices. The PageRank implementa- 
tion was successfully applied on the Wikipedia corpus with 
5,716,808 entries for 100 iterations within 1 hour, and shows 
encouraging results which are "empirically" good. 

Several lines of future work suggest themselves. Limited by 
the simple functional representation of MapReduce, we can- 
not at the moment implement more sophisticated algorithms 
which may involve instant machine communications. Sim- 
ply, the "online version" of MapReduce may be worth consid- 
ering for the future development on MapReduce structure. 
Once the "online" version has been implemented, a large col- 
lection of stream-based algorithms especially involving real- 
time updating techniques may become possible targets on 
MapReduce. Fortunately, a prototype "MapReduce Online" 
has been designed in jo], and a Hadoop versiorQis also avail- 
able. 

To conclude, both theoretical and experimental results show 
that our generic approach is widely applicable for upscaling 
similarity and iterative-based learning problems on MapRe- 
duce and reveals potential on data sets with various scales. 
Performance may be further improved by setting up a fitting 
partition schema. 

8. OPEN SOURCE IMPLEMENTATION 

The techniques discussed in this paper forms the prototype 
of project Big02 |http://code. google. com7p/bigo2/| . 
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